Introduction
Masaki Kobayashi’s Harakiri (1962) is a Japanese film about a samurai who asks to commit ritual suicide at a lord’s palace. Throughout the film, the audience learns the story of what brings the samurai to the palace. At the palace, the samurai argues with and disrespects the lord’s samurais, in revenge for past wrongs. These layers of disrespect lead to conflict and the main samurai kills many of the lord’s in combat. The film ends with the lord’s history where the events of the film are manipulated and recorded incorrectly to preserve honor. Contrastingly, when a person dies in the modern United States, their causes of death and characteristics are meticulously recorded with substantial effort put in to accuracy — not for honor — but for statistics. We are here to do those statistics. Cue epic music.
Resources: Website PDF Slides Github
Primary Dataset
The National Bureau of Economic Research creates and distributes a dataset of US mortality for every year since 1959. This dataset is unique for both its breadth and depth. Each row in the dataset represents a single death, and each column represents a different demographic characteristic of the deceased. The information is derived from death certificates filed by medical professoinals in the 50 states plus Washington DC. We made the decision to use the 2019 edition of the dataset since we did not want to focus on COVID-19. Notable information the dataset contains is education, sex, age classification, day of month, place of death, weekday, manner of death, cause of death, and different risk factors that the deceased had. In 2019, there were 2,861,523 deaths total. The following are the 10 most common groups split up by race, age, education, and sex
| count | race | age | education | sex |
|---|---|---|---|---|
| 91357 | White | 85-89 years | High school graduate or GED | F |
| 88154 | White | 90-94 years | High school graduate or GED | F |
| 76361 | White | 80-84 years | High school graduate or GED | F |
| 61108 | White | 75-79 years | High school graduate or GED | F |
| 61038 | White | 80-84 years | High school graduate or GED | M |
| 60104 | White | 75-79 years | High school graduate or GED | M |
| 57254 | White | 85-89 years | High school graduate or GED | M |
| 53786 | White | 70-74 years | High school graduate or GED | M |
| 50247 | White | 65-69 years | High school graduate or GED | M |
| 48484 | White | 70-74 years | High school graduate or GED | F |
Secondary Dataset
For our secondary dataset, we are using the Behavioral Risk Factor Surveillance System Survey. This survey includes different free text survey questions from across the United States and territories with responses broken out by subgroup. There is also information on sample size, percent affirmative response, and confidence interval bounds. We combine the secondary dataset by matching up subgroups between the death dataset and the risk factor dataset and trying to use aggregate statistics to analyze how risk factors can be matched with causes of death.
Questions
The main question we are asking is: given someone is dead, can we predict how they died? We will approach this by looking at factors such as age, gender, place of death, educational level, health conditions, and race, and build models for manner of death.
Additionally, we would also like to observe if there are any irregularities or trends in mortality when looking at specific factors such as the relationship between manner of death across months, deaths by age, deaths by gender and manner of death by education level. This will both inform us on trends in mortality among different demographic factors while also enhancing our predictions about how someone died based on the factors they identify with.
Killer Plot
This plot demonstrates the most common manners of death among people
in different cross sections of age and marriage. Head scale is
determined by natural causes. Neck scale is determined by pending
investigation. Left arm scale is determined by accident. Right arm scale
is determined by homicide. Left leg scale is determined by suicide.
Right leg scale is determined by could not determine.
The following table displays the average number of records for deaths in the given age ranges. Most age groups die with about 3 records. The average gets lower for younger age groups and older age groups, peaking in the middle age ranges. The age range with the highest record average is 25-34 years with an average record count of 3.304.
| Age | N | Average Record Count |
|---|---|---|
| Under 1 year | 21012 | 2.181 |
| 1-4 years | 3701 | 2.781 |
| 5-14 years | 5541 | 2.826 |
| 15-24 years | 29979 | 3.010 |
| 25-34 years | 59543 | 3.304 |
| 35-44 years | 83472 | 3.295 |
| 45-54 years | 161212 | 3.219 |
| 55-64 years | 376411 | 3.195 |
| 65-74 years | 557075 | 3.200 |
| 75-84 years | 689088 | 3.149 |
| 85 years and over | 874198 | 2.982 |
| Age not stated | 291 | 2.577 |
The following table displays the average number of records for deaths in the given marital statuses. The average record count for all groups is around 3.1. There is not much variation by group, and the marital status with the highest record count is “Marital status unknown” with an average record count of 3.257. The range of these averages is 0.226, so there is very little variation by marital status.
| Marital Status | N | Average Record Count |
|---|---|---|
| Divorced | 478548 | 3.214 |
| Married | 1038238 | 3.134 |
| Never married, single | 403235 | 3.137 |
| Marital status unknown | 23155 | 3.257 |
| Widowed | 918347 | 3.031 |
The following table displays the average number of records for deaths by manner of death. Most manner of death groups die with about 3 records. The average is much lower for “Pending Investigation”, with an average of 1.311. The highest average record is for the manner of death “Accident” with an average of 4.005. Every other manner of death yields an average record count close to 3.
| Manner | N | Average Record Count |
|---|---|---|
| Accident | 173608 | 4.005 |
| Suicide | 47764 | 2.930 |
| Homicide | 20310 | 3.236 |
| Pending Investigation | 4484 | 1.311 |
| Could Not Determine | 11800 | 2.884 |
| Natural | 2327811 | 3.028 |
| Not Specified | 275746 | 3.358 |
Exploration
Deaths by Weekday
First, we plotted weekday of death versus death counts. There were the most deaths on Tuesday. However, days have an average of 7839.789 deaths and 2019 had an extra Tuesday so adjusting for that, the most deaths were on Fridays.
Deaths by Month
The most deaths occur in the coldest and darkest months of the year which are February, January, December, and March. Summer months have lower deaths by around 10-11%.
Interestingly, the large number of deaths in the winter months and lower numbers is summer can be entirely explained by natural causes deaths. Since most deaths are due to natural causes, even a small increase in deaths due to natural causes can have a large impact on the total number of deaths. The reasoning for increased deaths due to natural causes in winter months is becauase people spend more time inside with cold weather which leads to increased disease transmission.
| Month | Death Count | Death Count Excluding Natural |
|---|---|---|
| April | 7860 | 677.7 |
| August | 7349 | 733.8 |
| December | 8277 | 723.1 |
| February | 8333 | 679.6 |
| January | 8331 | 668.5 |
| July | 7415 | 737.7 |
| June | 7534 | 727.1 |
| March | 8239 | 686.3 |
| May | 7659 | 695.4 |
| November | 7986 | 718.3 |
| October | 7683 | 710.1 |
| September | 7442 | 721.5 |
Deaths by Age
Next, we plotted age versus death counts. Deaths were most prevalent among older age groups such as those between 70 and 84, although deaths start increasing more quickly at age 60. There is also a spike in those less than 1 day old. However, those greater than 1 day old do not frequently die. We also created a version of the plot scaled to bucket size. For privacy reasons, the NBER does not release ages of deaths but rather different buckets that the ages fall into. These bucket are of different lengths of time so we created a rescaled version. This plot was then put on a log scale to better showcase the data.
Deaths by Manner
Here, we plotted the manner of death versus age and counted how many people of a certain aged died based on a certain manner of death. A few key finding of this analysis shows that the majority of people die from natural causes, especially those aged 60+ and less than 1 day old, and accidental causes, spanning across all age groups. What this plot may help to inform us about is the behavior and activities that people in a general age group may commonly engage in that may have lead to their manner of passing. By being observant of the manners of death based on age group, preventative methods can be used to decrease the number of accidental related deaths if we are able to determine commonly engaged activities for age groups. Using this plot will help us answer the cause of death among the different age groups, and further promote research in what actual activities people are participating in that lead to their manner of death.
Deaths by Age and Gender
In this plot, we plotted the number of deaths versus age ranges while demonstrating how many men compared to women passed away in each age category. In each of the bars, the red fill represents the amount of women who passed away in that particular age range while the blue accounts for the amount of men. The percentage seen in each bar represents the proportion of men in a given age range that passed away. This analysis shows that the majority of people under the age of 80 who pass away tend to be men, as nearly every bar from ages 0-80 shows the proportion of male deaths to be above 50%. This proportion of male deaths goes down after 80 years of age, and is likely because women who are of an older age tend to live to a complete life expectancy. What this plot may help to inform us about is the differences in male and females lives and life expectancies. Further research into differences in lifestyle choices for men versus women as a whole may help better explain why women tend to live longer than men. Furthermore, this plot accompanied with a plot on cause of death by gender, may assist in determining what kind of, potentially more risky, behaviors men may partake in during their lifetimes that lead to an earlier death than women.
Cause of Death by Education
For most causes of death, level of education does not have an impact on what proportion of people have that cause of death. The largest difference belongs to “Certain conditions originating in the perinatal period” with high occurrences in those with 8th grade or less education and those with unknown education and nearly no occurrences in all others. Another large proportion difference is in “Congenital malformations” where 8th grade or less has a much higher mortality proportion than other education levels. For causes of death that are not highly tied to conditions at birth, “Syphilis” and “Assault (homicide)” have the highest differing proportions. “Syphilis” has a quite small sample size but unknown education has the highest mortality proportion. For “Assault (homicide)”, 9 - 12th grade, no diploma has the highest mortality proportion.
Analysis
Free Text Analysis: Selected Health Conditions by Age Group
Using regular expressions, we polled the BRFSS data set for questions related to heart conditions, cancer, and depression, while grouping by the age of respondents. Furthermore, we restricted entries to those where participants responded positively, indicating that they did have those conditions. Interestingly, all age groups except 65+ had a high incidence of depression, hovering around the 20% mark. This dips significantly to 15% for the 65+ age group, perhaps because mental health was more stigmatized during their lives and psychological diagnoses were less readily available. In addition, the orange and olive green lines (for heart attack and coronary heart disease, respectively) have a significant degree of overlap, which makes sense, given the conditions. The green line (corresponding to non-skin cancer), is slightly higher then the blue line (corresponding to skin cancer) for all age groups. Furthermore, both cancer have positive slopes, indicating that as your age increases, you become more likely to be diagnosed with cancer.
Logisitic Classification
The first model we tried was logistic classification. We created indicator variables for each of the manners of death before fitting models containing age, record count, sex, education, place of death, cause of death, and race to each of the indicator variables. Below is a plot of the McFadden R2 values which demonstrates that the logistic classification regression models were much more effective for manners of death like suicide and homicide than for others like not specified and natural.
Decision Tree & Random Forest
The decision tree predicts with 82.11% accuracy and the random forest
predicts with 88.89% accuracy. Each node contains a yes or no
classification for either a factor or a logical comparator for a
continuous variable. Each leaf contains the classified manner of death
and 7 numbers. Each of the 7 numbers corresponds to the count of each
manner that was classified to that. The first number is accident, second
is suicide, third is homicide, fourth is pending investigation, fifth is
could not determine, sixth is natural, and seventh is not specified.
Ultimately, we built a model that can effectively classify a manner of death given someone is dead. Additionally, we derived interesting insights from the data and how deaths are associated with weekdays, marital status, time of year, record counts and more.